A Template Discovery Algorithm by Substring Amplification
نویسندگان
چکیده
In this paper, we consider to find a set of substrings common to given strings. We define this problem as the template discovery problem which is, given a set of strings generated by some fixed but unknown pattern, to find the constant parts of the pattern. A pattern is a string over constant and variable symbols. It generates strings by replacing variables into constant strings. We assume that the frequency distribution of replaced strings follow a power-law distribution. Although the longest common subsequence problem, which is one of the famous common part discovery problems, is well-known to be NP-complete, we show that the template discovery problem can be solved in linear time with high probability. This complexity is achieved due to the following our contributions: reformulation of the problem, using a set of substrings to express a string, and counting all occurrences F( f ) with frequency f instead of just frequency f . We demonstrate the effectiveness of the proposed algorithm using data on the Web. Moreover, we show noise robustness and effectiveness even when input strings are generated by a union of patterns and pattern with the iterate operation.
منابع مشابه
Sparse Substring Pattern Set Discovery Using Linear Programming Boosting
In this paper, we consider finding a small set of substring patterns which classifies the given documents well. We formulate the problem as 1 norm soft margin optimization problem where each dimension corresponds to a substring pattern. Then we solve this problem by using LPBoost and an optimal substring discovery algorithm. Since the problem is a linear program, the resulting solution is likel...
متن کاملAlgorithmic Program Synthesis
To solve a problem with a dynamic programming algorithm, one must reformulate the problem such that its solution can be formed from solutions to overlapping subproblems. Because overlapping subproblems may not be apparent in the specification, it is desirable to obtain the algorithm directly from the specification. We describe a semi-automatic synthesizer of linear-time dynamic programming algo...
متن کاملA New RSTB Invariant Image Template Matching Based on Log-Spectrum and Modified ICA
Template matching is a widely used technique in many of image processing and machine vision applications. In this paper we propose a new as well as a fast and reliable template matching algorithm which is invariant to Rotation, Scale, Translation and Brightness (RSTB) changes. For this purpose, we adopt the idea of ring projection transform (RPT) of image. In the proposed algorithm, two novel s...
متن کاملA Practical Algorithm to Find the Best Episode Patterns
Episode pattern is a generalized concept of subsequence pattern where the length of substring containing the subsequence is bounded. Given two sets of strings, consider an optimization problem to find a best episode pattern that is common to one set but not common in the other set. The problem is known to be NP-hard. We give a practical algorithm to solve it exactly.
متن کاملPROSIDIS: A Special Purpose Processor for PROtein SImilarity DIScovery
This work presents the architecture of PROSIDIS, a special purpose processor designed to search for the occurrence of substrings similar to a given ‘template string’ within a proteome. The paper recalls the basis of the PHG tool, developed in the framework of the HADES project, which automatically designs a parallel hardware starting from recurrence equations. In this work we present a special ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004